Semantic Domains in Computational Linguistics
نویسندگان
چکیده
Ambiguity and variability are two basic and pervasive phenomena char-acterizing lexical semantics. Their pervasiveness imposes the developmentof Natural Language Processing systems provided by computational modelsto represent them in the application domain. In this work we introducea computational model for lexical semantics based on Semantic Domains.This concept is inspired by the “Theory of Semantic Fields”, proposed instructural linguistics to explain lexical semantics. The main property ofSemantic Domains is lexical coherence, i.e. the property of domain relatedwords to co-occur in texts. This allows us to define automatic acquisitionprocedures for Domain Models from corpora, and the acquired models pro-vide a shallow representation for lexical ambiguity and variability. DomainModels have been used to define a similarity metric among texts and termsin the Domain Space, where second order relations are reflected. Topicsimilarity estimation is at the basis of text comprehension, allowing us todefine a very general domain driven methodology. The basic argument weput forward to support our domain-based approach is that the informationprovided by the Domain Models can be profitably used to boost the perfor-mances of supervised Natural Language Processing systems for many tasks.In fact, Semantic Domains allows us to extract domain features for texts,terms and concepts. The obtained index, adopted by the Domain Kernel toestimate topic similarity, preserves the original information while reduc-ing the dimensionality of the feature space. The Domain Kernel is used to define a semi-supervised learning algorithm for Text Categorization thatachieves the state-of-the-art results while decreasing by one order the quan-tity of labeled texts required for learning. We also apply Domain Modelsto approach a Term Categorization task, improving noticeably the predic-tion accuracy on domain specific terms. The property of the Domain Spaceto represent together terms and texts allows us to define an IntensionalLearning schema for Text Categorization, in which categories are describedby means of discriminative words instead of labeled examples, achievingperformances close to the human agreement. Then we investigate the roleof domain information in Word Sense Disambiguation, developing both anunsupervised and a supervised approaches that strongly rely on the notionof Semantic Domain. The former is based on the lexical resource Word-Net Domains and the latter exploits both sense tagged and unlabeled datato model the relevant domain distinctions among word senses. Our super-vised approach improves the state-of-the-art performance in many tasks fordifferent languages, while reducing appreciably the amount of sense taggeddata required for learning. Finally, we present a multilingual lexical ac-quisition procedure to obtain Multilingual Domain Models from compara-ble corpora. We exploit such models to approach a Cross Language TextCategorization task, achieving very promising results largely surpassing abaseline. KeywordsLexical Semantics, Word Sense Disambiguation, Text Categorization, Mul-tilinguality, Kernel Methods
منابع مشابه
Semantic Domains and Linguistic Theory
This paper is about the relations between the concept of Semantic Domain and the “Theory of Semantic Fields”, a structural model for lexical semantics proposed by Jost Trier at the beginning of the last century. The main limitation of the Trier’s notion is that it does not provide an objective criterion to aggregate words around fields, making the overall model too vague, and then unuseful for ...
متن کاملA Simple Metric to Measure Semantic Overlap between Models: Application and Visualization
This paper investigates a fairly simple but easily automatable metric for measuring the semantic overlap between models as a proxy to the degree of overlap between their domain coverage. Such a metric is very useful when evaluating competing models with fully or partially overlapping domains, be it for purposes of model integration, re-use or selection. The proposed metric is based on the seman...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کاملA Categorial Framework for Composition in Multiple Linguistic Domains
We describe a computational framework for a grammar architecture in which different linguistic domains such as morphology, syntax, and semantics are treated not as separate components but compositional domains. Word and phrase formation are modeled as uniform processes contributing to the derivation of the semantic form. The morpheme, as well as the lexeme, has lexical representation in the for...
متن کاملVerbs in Applied Linguistics Research Article Introductions: Semantic and syntactic analysis
This study aims to investigate the semantic and syntactic features of verbs used in the introduction section of Applied Linguistics research articles published in Iranian and international journals. A corpus of 20 research article introductions (10 from each journal) was used. The corpus was analysed for the syntactic features (tense, aspect and voice) and semantic meaning of verbs. The finding...
متن کامل